Neural Networks Summary

I took Professor Geoffrey Hinton's neural network class and started on 2017-05-04. I'm summarizing the lecture slides, my own notes from the lectures and supplementing with other information I can find online.

Different Types of Neurons

Linear

Activation follows a linear function:

$y = b(bias) + \sum{x_i w_i}$

Sigmoid

Activation follows sigmoid function:

$z = b + \sum{x_i * w_i}$

$y = \frac{1}{1 + e^{-z}}$

Binary Threshold

Activation function is on or off. b can be negative such that:

$z = b + \sum{x_i * w_i}$

$y = \begin{cases} 1 & \quad \text{if } z \geq 0\\ 0 & \quad \text{otherwise} \\ \end{cases} $

Rectified Linear

Activation has threshold and linear function beyond the threshold:

$z = b + \sum{x_i * w_i}$ (bias can be negative to elongate activation)

$y = \begin{cases} z & \quad \text{if } z > 0\\ 0 & \quad \text{otherwise} \\ \end{cases} $

Stochastic Binary Neuron

Activation is probability of producing activation:

$p(s=1) = \frac{1}{1 + e^{-z}}$

Hyperparameter Tuning

Meant to help generalize the model to future data. Usually used on the cross-validation set.

Overfitting can also be avoided by:

weight decay
weight share
early stopping
model averaging
bayes fitting
drop out
generative pretraining

In mini-batch SGD, turn down the learning rate towards the end of learning. It's slower learning but it helps minimize the error.

Speed up mini-batch with:

use momentum
rmsprop

Gradient Descent

Calculate a loss function moving towards a global/local minumum. Run some data through the network, calculate loss, update weights and do it again. Calculating loss and updating weights should help reduce the overall error.

To improve SGD:

normalize the inputs
decorrelate the inputs (i.e. PCA)

Grid Search

Given a set of hyperparameters, create a cartesian product of the params and run a model with each member of the set of the cartesian product.

Ex:

param a: {1,2,3}

param b: {a,b,c}

grid = {(1,a), (1,b) ... (3,c)}

for mem in grid:
  run_model(mem)

Calculating Error and Adjusting Weights

Linear Function

helpful source

Calculates a linear loss on a continuous output.

NOTE: target = the true class label

Loss Function:

$J(w) = \frac{1}{2} \sum{ (target^i - output^i)^2 }$

Calculating Change in Weights:

$\Delta w_j = - \epsilon \frac{\partial J}{\partial w_j}$ where $\epsilon$ is the learning rate and j/w_j is the partial derivative of loss funtion with respect to the changing weights

$\Delta w_j = \epsilon \sum{(t^i - o^i)x_j^i}$

Logistic Function

Stanford Lecture Notes I find those to be more helpful than understanding the lecture notes for this section.

Calculates logistic loss on a binary output. Penalizing very wrong outputs very strongly and not so wrong outputs not so strongly.

Loss Function:

$J(w) = \displaystyle\prod_{i=1}^{m} p(y^i \mid x^i; w)$

We want to maximize the log likelihood. This is also known as the cross entropy function.

NOTE: $h(x) = \frac{1}{1 + e^{-z}}$

$log J(w) = \displaystyle\sum_{i=1}^{m}{y^i * log h(x)^i + (1 - y^i) log(1 - h(x)^i)} $

Softmax Function

Used for classification of K number of classes. All values of the output sum to one. All outputs represent a probability distribution across discrete alternatives.

$ P(y = j \mid x) = \displaystyle\frac{e^z_i}{\sum_{k=1}^{K}{e^z_k}} $

$ J(w) = -\displaystyle\sum_j{t_j log y_j} $

NOTE: lectures didn't show how to change weights

Momentum

... damps oscillations in directions of high curvature by combining gradients with opposite signs

... builds up speed in directions with a gnetle but consistent gradient

$v(t) = \alpha v(t-1) - \epsilon \displaystyle\frac{\partial E}{\partial w}(t)$

!! very important !! first make a big jump in the direction of the previous accumulated gradient. Then measure the gradient where you end up and make a correction. The learning (by Nesterov, 1983) is that it's better to correct a mistake after it's been made. It yeilds better learning.

RMSprop

Use only the sign of the gradient

rprop a full batch version:

In full batch learning, hold the learning rate the same but only change the sign of the gradient. Has the advantage of escaping from plateaus with tiny gradients quickly. Increase the step size for the weight multiplicatively iff the last two updates have the same sign (indicates we're going down hill). Else, decrease the weight multplicatively (lecture suggests x 0.5).

RMSprop keeps a moving average of the squared gradient for each weight:

$MeanSquare(w, t) = 0.9 MeanSquare(w, t-1) + 0.1 (\frac{\partial E}{\partial w} (t))^2$

Hessian Free

Wide curves vs. deep curves. On a wide curve, we want to make a big jump. On a deep curve, we want to move torwards the bottom without overshooting. Multiply by the inverse of the curvature matrix. Use conjugate gradient.

Conjugate means that as you go in the new direction, you do not change the gradients in the previous directions

Conjugate gradient is guaranteed to find the minumum of an N-dim quadratic surface because calculating the gradient of the quadratic surface moves towards a global minimum. Hand-wavey math guarantees we always move towards a better place in the next step. The steps are "conjagate" (better terminology is coupled IMHO) and coefficients are calculated to maximize the step size towards the minumum error.

Good Explanation Brush up on quadratic approximations @ Kahn Academy

SGD is a first order optimization problem. It will optimize linear local curves. HF (and conjugate gradient) use second-order methods which supply the means to deal with seeking a minimum across the whole surface. Surfaces are curved and planes are flat. We need surface math.

In Conjugate Descent, $\alpha$ can define the step size and $\beta$ can define the direction. $\alpha$ is calculated using a line-search algo and $\beta$ is a scalar value respecting the direction of the last step (conjugate)

Autoregressive Models

Predicts the next output given a sequence of previous outputs. Uses "delay taps." Reminds me of commonly predicting the weather: tomorrow it will rain because the last two days in a row have rained.

Feed Forward NN

Generalize autoregressive models bu using one or more layers of non-linear hidden units.

Generalization

Prevent overfitting with

Get more data
use model with "right" capacity
- # of hidden layers
- early stopping
- weight decay
- noise
Average many different models
Bayesian: use a single neural network architecture but average the predictions made

Multiple Models

Help avoid overfitting and can easily extend beyond NN

Find models that have different biases and boost weights in different areas to enhance accuracy of overall prediction. "Mixture of experts" can model particular subsets of the data; global vs. local models.

Use dropout to improve generalization

Full Bayesian Learning

Instead of trying to find the best single setting of the parameters (as in Maximum Likelihood or MAP) computer the full posterior distribution ove all possible parameter settings

very computationally expensive for anything other than trivial models
Allows us to create complicated models without much data

In cases where there is little data, developing posterior estimates is expensive but helpful to avoid overfitting. Using grid search with 6 weights and 9 discrete values produces 9^6 grid-points. Lots of computation but helps direct the modeling and may improve generalization.

Amazing Fact: if we use just the right amount of noise, and if we let the weight vector wander around for long enough before we take a sample, we wil lget an unbiased sample from the true posterior ovre the weight vectors.

This is Markov Chain Monte Carlo and it makes it possible to use full Bayesian learning with thousands of params. Learning involves allowing the models to "wander around" the weight space and explore minima (wrt to cost function). As we encourage learning to trend towards minima, but still allow exploration, we observe samples that find optimal weights.

Contrastive Divergence (CD)

A "surprising shortcut" for learning in models that would normally maximize the log likelihood. Does not follow a gradient but works well in practice. Useful in Boltzmann and RBMs (generative binary architectures).

Start with training vector on the viz units
Update all hidden units in parallel
Update all visible units in parallel -> this is reconstruction
Update hidden units again

$\Delta w_{ij} = \epsilon ( <v_i h_j>^0 - <v_i h_j>^1 )$

Starting at the data, the MCMC begins to favor configurations with lower energy. We can observe the direction of this favor and inturrupt learning before it completely reaches equilibrium. Inturruption is okay becuase we know the weights are "bad" (the weights are updating away from predicting the data). Learning happens by lowering the probability of configurations that do not result in the known visible states and raise the probability of the configurations that do result in known visible states.

Learning cancels out once the confabulations and the data have the same distribution

In practice, beginning with small weights and CD1 (1 = one full step) is quick and mostly correct. As weights grow, the effects of the MCMC are less pronounced and we can use CD3. As weights continue to grow move up to CD10.

CD learning rule for a binary unit is the same as for a softmax.

wikipedia has a good explanation too

Wake-sleep Algorithm

It's hard to learn a complicated model like a sigmoid belief network because calculating the posterior distribution overall all hidden configurations is nearly intractable.

Key thought: Do inference wrong and hope that learning still works. The wrong part is to assume that the posterior over all hidden configurations factorizes into a product of distributions for each hidden unit.

For three hidden units with probability = (0.3, 0.6, 0.8)

$p(1,0,1) = 0.3 * (1 - 0.6) * 0.8$

Wake Phase: Recognition weights to reconstruct activities in each layer (moving from input layer into the network)

Sleep Phase: Generative weights to reconstruct weights from each layer moving from hidden layers into the input (reconstructs input)

Flawed because:

incorrect mode-averaging
posterior at the deepest hidden layer is very far from independent

Mode averaging - inputs with modes of (0,1) or (1,0) will result in recognition weights learning (0.5) and (0.5). A better solution would be to just pick one of (0,1) or (1,0).

Summary of Learning Methods

Small datasets or large datasets without much redundancy, use full-batch method

conjugate gradient
LBFGS (not discussed in lecture)
adaptive learning rates
rprop

Big, rundant datasets, use mini-batch

gradient descent with momentum
rmsprop
whatever LeCun has cooked up

Spectrum of Machine Learning Tasks

NOTE: this is a verbatim snapshot of a slide from lecture 13. I think it helps describe the practicality or feasibility of employing certain statistical techniques versus more complicated ML techniques

Typical Statistics

Low-dimensional data
Lots of noise
Not much structure but the structure can be captured in a simple model
The main problem is separating true structure from noise

Artificial Intelligence

High dimensional data
Noise is not the main problem
There is a huge amount of structure in the data, but it's too complicated to be represented by a simple model
The main problem is figuring out a way to represent the complicated structure so that it can be learned (i.e. let backprop figure it out)

Different NN Architectures

Perceptron:

A very simple network architecture. Features are not learned, they're designed, and weights are learned.

supervised
linear
binary output

Learning Procedure:

Guaranteed to work:

foreach(trainingex) {
  if output is correct, don't change weights
  if output is 0, add input vector to weights
  if output is 1, subtract input vector to weights
}

From SO post

$\partial L_{\pmb{w}}(y^{(i)}) = \begin{array}{rl} \{ 0 \}, & \text{ if } y^{(i)} \pmb{w}^\top\pmb{x}^{(i)} > 0 \\ \{ -y^{(i)} \pmb{x}^{(i)} \}, & \text{ if } y^{(i)} \pmb{w}^\top\pmb{x}^{(i)} < 0 \\ [-1, 0] \times y^{(i)} \pmb{x}^{(i)}, & \text{ if } \pmb{w}^\top\pmb{x}^{(i)} = 0 \\ \end{array}$

Weights from multiple models can be averaged and produce another valid model

Can find patterns, but not patterns that "wrap-around"

Recurrent Neural Network

Generic structure of NN. Many special case instances follow

Good for processing sequences of data, speech and image recognition. Use internal memory.

directed graph
forward prop
back prop

Cannot know the hidden states. We could only know a probability distribution of space.

can oscillate
can settle to point attractors
have difficulty dealing with long range dependencies

Learning Procedure

This is backprop assuming SGD:

Inputs are multiplied by weights into a hidden neuron. Hidden neurons are then multiplied by separate weights to produce either more hidden neurons our output neurons. Forward pass complete. Error is computed and then weights in each layer (input => hidden, hidden => output) are updated once more.

Lectures recommend forward pass using squashing functions (like logistic) to prevent vectors from exploding. Backward pass should be linear. Forward pass determines slope function for backpropagating through each neuron.

4 Effective ways to learn an RNN:

1) LSTM 2) Hessian Free Optimization 3) Echo State Networks - means of initializing the connections very carefully so that each hidden state has a reservior of weakly coupled scillators 4) Good initialization w/ momentum

Add a penalty for changing any of the hidden activities too much (noted in HF)

Convolutional Neural Networks (CNNs)

Characterized by a learning procedure of repeating extracting features (subsampling) and pooling those features (convolutions). This process reduces the number of total features learned thus getting around the curse of dimensionality. It also helps generalize the model.

Using McNemar's test can help diagnose the severity of errors and help understand if the network is improving with different conditions.

LTST memory NN

an implementation of RNN with read gate, write gate and keep gate

does not have vanishing/exploding gradient problem

Echo State Network

A reservoir of hidden units that are set random and fixed are used to learn the last layer (usually linear model).

Fix the input -> hidden and hidden -> hidden connection to random values. Only learn the hidden -> output layer. Use sparse connectivity. It creates a lot of loosely coupled oscillators. This was modeled in lecture with a sine wave.

They learn very fast becuase learning is limited to the last linear layer
Not good for high dimensional data
Very good for one dimensional time series

Hopfield NN

A special case RNN characterized by binary threshold neurons without any hidden units. Guaranteed to converge to a local minimum but will sometimes converge on a false pattern. John Hopfield realized that if connections are symmetric there is a global energy function. Symetric = one connection weight and the binary states of two neurons.

NOTE:

s = state

E = energy

$ E = - \sum_i{s_i b_i} - \sum_{i<j}{s_i s_j w_{ij}} $

Quadratic energy function enables the calculation of each individual unit's effect on the global energy:

$ Energy gap = \Delta E_i = E(s_i = 0) - E(s_j = 1) = b_i + \sum_j{s_j w_{ij}} $

-E = goodness

With states = 1 and -1, $\Delta w_{ij} = s_i s_j$. With states 0 and 1, $\Delta w_{ij} = 4(s_i - \frac{1}{2})(s_j - \frac{1}{2})$

Spurious minima are a pervasive problem. Two local minima that are close to each other in space can combine into a deeper local minimum. Difficult to overcome. Unlearning can be very effective in overcoming this shortfall. "Breaking" connections improves the robustness of the net. E. Dardiner suggests cycling through the training set many times using the perceptron convergence procedure during training.

Boltzmann Machine

Given a training set of binary vectors, [a BM will] fit a model that will assign a probability to every possible binary vector

$p( Model_i \mid data) = \displaystyle\frac{p(data \mid Model_i)}{\sum_j{p(data \mid Model_j)}} $

Very similar to Hopfield NN:

RNNs
Binary vector inputs
Refer to Energies
Use posterior distributions to describe learning

"Boltzmann machines can be seen as the stochastic, generative couterpart of Hopfield nets." Wikipedia

Learning happens by picking a visible state (i.e. a binary vector), sum over all possible hidden state probabilities that could produce that visibile state:

The math of a causal generative model (Boltzmann Machines are not causal):

$p(v_{vec}) = \displaystyle\sum_h{p(h) * p(v \mid h) }$

The probability of the observed vector and hiddent units is proportional to the exponent of the energy of the model:

$p( \mathbf{v}, \mathbf{h}) \text{ } \alpha \text{ } e^{-E(\mathbf{v}, \mathbf{h})}$

Joint Configuration:

$-E(\mathbf{v}, \mathbf{h}) = \displaystyle\sum_{i \in vis}{v_i b_i} + \displaystyle\sum_{k \in hidden}{h_k b_k} + \displaystyle\sum_{i \lt j}{v_i v_j w_{ij}} + \displaystyle\sum_{i,k}{v_i h_k w_{ik} } + \displaystyle\sum_{k < l}{h_k h_l w_{kl} } $

$ v_x \text{has binary state } b_k \text{bias of unit k} $

Taken directly from the slide:

The probability of a join configuration over both visible and hidden units depends on eh energy of that joint configuration compared with the energy of all other joint configurations:

$ p( \mathbf{v}, \mathbf{h}) = \frac{\displaystyle\exp(-E(\mathbf{v}, \mathbf{h})}{\displaystyle\sum_{u,g}{exp(-E(( \mathbf{u}, \mathbf{g}))}}$

The probability of a configuration of the visible units is the sum of the probabilities of all the join configurations that contain it:

$p(\mathbf{v}) = \frac{\displaystyle\sum_h{\exp(-E(\mathbf{v}, \mathbf{h}))}} {\displaystyle\sum_{u,g}{\exp(-E(\mathbf{u}, \mathbf{g}))}} $

NOTE: I think that the notation $\mathbf{v}$ indicates the notion that visible states are being held fixed and the hidden states are updating. $\mathbf{u, g}$ indicates that all nodes are being updated.

(this is the lecture where the table of all possible visible units, hidden units, Energy, exp(Energy) and probability was generated)

Calculating the normalizing term is computationally infeasible so, instead, we use MCMC to sample the starting from random global configurations. Continue to run the Markov chain until thermal equilibrium at which point the probability of the global configuration is related to its energy by the Boltzmann distribution.

"Clamping" the visibile units (using training data) and allowing the hidden units to update stochastically helps the model learn the states of the best configuration of hidden units to explain the observed data.

Learning in network architecture involves maximizing the product of the probabilities that the machine assigns the binary vectors in the training set. It's equivalent to maximizing the probability that we could produce the N training cases (binary vectors) if we let the network settle to its equilibrated distribution.

$\Delta w_{ij} \text{ } \alpha \text { } \langle s_i s_j \rangle _{data} - \langle s_i s_j \rangle _{model}$

NOTE: <s_i s_j> is the expected value

$\displaystyle\frac{\partial log p(v) }{ \partial w_{ij}} = \langle s_i s_j \rangle _{\textbf{v}} - \langle s_i s_j \rangle _{model} $

The "positive" phase holds v constant and finds hidden node configurations that lowers the overall energy. This is done for every data vector in the training set. The "negative" phase randomoizes all nodes and finds best competing configurations and raises that configuration's energy. Inefficient, but helpful to collect the stats required for learning.

No backprop is needed in this model because the process of moving towards thermal equilibrium propogates information about the weights.

Restricted Boltzmann Machines

A specialized implmentation of Boltzmann Machines with only one layer of hidden units and no connections between the hidden units.

This lecture was primarily focused on CD learning.

Use case of collaborative filtering was used:

Matrix factorization works well and has certain error. RBMs work about the same but has a different error profile. In an ensemble, these models are combined and work very well together.

Users, Movies and Ratings (softmax output). Each user has rated N movies and probably a different set of N movies. Instead of a global RBM, each user is modeled with its own RBM with shared hidden units among all users. Movie-Rating weights for rated movies are shared.

Sigmoid Belief Nets

A directed acyclic graph composed of binary stocastic neurons. Like an RBM it's a generative model where learning is focused on generating configuration that produces inputs. Unlike RBM, it's not energy based, it's causal.

Learning is focused on maximizing the log probability of the training vector (binary) is produced from it's anscestor. The math is based on some inference algos that were developed to explain graphical models.

$p \equiv p(s_i = 1) = \frac{1}{1 + \exp(-b_i - \displaystyle\sum_j{s_j w_{ji}})} $

$\Delta w_{ji} = \epsilon * s_j (s_i - p_i)$

Complicated learning routine because one must sample from a posterior that is not factorial (it's not factorial because hidden nodes that are independent become dependent when connected in the NN and are both used to influence the outcome of the visibile node). All weights interact - the posterior depends on the prior + the liklihood of the parent layers.

Deep Belief Network

A generative graphical model that combines unsupervised learning to act as feature detectors with supervised learning to classify.

Learning happens by linking unsupervised NNs (RBMs etc) together, applying contrastive wake-sleep (or CD according to wikipedia) at each step going "up" the network (generation) and then backprop on the way down (discrimination).

scales well to really big networks
works well even if most of the data is unlabeled (we can still find the structure and good features in the data)

Deep Autoencoders

A means of doing PCA type work in NN architectures.

An example is stacking 4 RBMs from a set of inputs. i => W_1 => h1 => W_2 => h2 => w_3 => h3 => w_4 => stack of n linear units => W_4^t => h_5 => W_3^t => h6 => W_2^t => h7 => W_1^t => output (roll and unroll the network)

A NN is trained to reproduce its input as its output but the middle hidden layer is a much smaller dimensional array of units. That small number of units becomes a good way to compare different things (like documents and images)